Meeting Hadley!
Last Monday, I had the pleasure of attending a talk given by Hadley Wickham at LondonR, which was held at one of their usual venues at the UCL Darwin Lecture Theatre.
For most readers of this blog, Hadley needs no introduction; the running joke amongst R users that if tidyverse hadn’t been rebranded it would’ve been known as the hadleyverse says it all. I had really looked forward to this event, as it’s always an interesting experience to meet in real life these people you seem to know so well or have heard so much about virtually. Another occasion I could recall was Hilary Parker’s keynote address at EARL, which I know her through her brilliant data science podcast (co-hosted with Roger Peng) Not So Standard Deviations.
In this post, I’m going to briefly summarize() (sorry 😆) what Hadley covered in his talk, and some of my thoughts on what he said. (potential spoiler alert?)
Tidyverse: the greatest hits
This was perhaps the busiest LondonR sessions I’ve ever been to, but understandably so! The lecture hall usually has a fair number of free seats left, but on this occasion late-comers struggled to find free seats. Speaking to others around me, nobody seemed to know what Hadley’s talk was going to be about. Apparently this is Hadley’s new talk - titled Tidyverse: the greatest hits.
But this turned out to be one of those attention-catching titles - what Hadley really planned to talk about was the greatest mistakes of the tidyverse. As he claims, whilst the intuitive expectation of good developers and R coders may be that they make fewer mistakes, it’s more the case for him that he makes many mistakes as fast as possible - which, I imagine, is partly responsible for his prolific body of work in R. Unfortunately, some of these “mistakes” have become “permanent” within tidyverse, which in his talk he explained the certain things some of us R users have questioned when using tidyverse packages.
Hadley mentioned a number of these “permanent mistakes”, and probably two of those which tidyverse users resonated the most with are:
- the conflicting function names with
stats::filter()andstats::lag() - the use of the
+operator rather than%>%in ggplot2
In the first case, it is the classic programming “problem” of naming variables / functions. You may think a variable name is intuitive or sensible initially, but not thinking it through can sometimes come back and bite you in the future. Hadley’s argument for choosing filter() as the dplyr verb for filtering rows despite the existence of stats::filter() is because of the relatively niche applications of the latter function. The documentation of stats::filter() goes:
Applies linear filtering to a univariate time series or to each series separately of a multivariate time series.
I’ve never used stats::filter() myself and personally find filter() to be quite an intuitive verb, so I’m not too much bothered by this one. Another similar function-naming “mistake” that Hadley talked about is the lack of intuitiveness of gather() and spread(), where it isn’t immediately clear to an unfamiliar tidyr user which of those functions converts data from long to wide format, and vice versa. Unlike dplyr::filter() where there are no plans for a new filtering / subsetting function, in the developmental version of tidyr there will be two new functions for pivoting data frames, pivot_wide() and pivot_long(), which remove the ambiguity you get in spread() and gather(). Note that there isn’t any intent to deprecate spread() and gather(), but I think you simply get two new alternatives which make code easier to read and use.
The other interesting mistake that Hadley talked about is the + operator in ggplot2. To put it simply, this refers to the problem that whilst the rest of the tidyverse uses the pipe operator %>% to chain analysis steps together, ggplot2 alone uses a different operator. Here’s a simple illustration of the problem:
iris %>% # You can pipe
select(Species, Sepal.Length, Sepal.Width) %>% # Still piping
ggplot(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) + # You cannot use pipe here
geom_point()If you do not use the + operator once you start to use the ggplot arguments, you get the following error message:
Error:
mappingmust be created byaes()Did you use %>% instead of +?
This is very much a mistake of legacy, because the magrittr pipe %>% was not in use when ggplot2 was written. Again, this feels like a quirk or inconvenience that tidyverse users will need to live with, but from a macro perspective ggplot2 is still a fantastic package with powerful functionality that played a significant role in popularising the use of R.
Hannah Frick from Mango Solutions lists a couple more “mistakes” that Hadley mentioned during his talk:
The greatest tidyverse mistakes:
— Hannah Frick (@hfcfrick) August 19, 2019
💥 no pipe in ggplot2
💥 overwriting filter and lag
💥 using the . in arg names
💥 tidyeval pushed too early
💥 tidyverse as a name made some people think it's meant to be used in isolation - nah, use it with whatever in #rstats is useful for you!